# Load packages
library("tidyverse")
library("kableExtra")
library("patchwork")
# Load data
load("data/r_data/data_cleaned.rda")
# Source functions
source("r_scripts/functions.R")
This is part II of a two-part series on this case study. You can find
part I here.
The final presentation of the findings can be downloaded as a Powerpoint
or PDF.
The scenario comes from a case study in Google’s Data Analytics Professional Certificate. The goal is to help Cyclistic, a fictitious bike-share company based in Chicago, convert casuals to members. The term casual refers to a user without a membership that either pays for each individual ride or purchases a day pass. Members have annual contracts - more on the specifics later. We have been tasked to identify differences between the two using historical trip data, which has been provided by Divvy. I wrote a separate Markdown in which I processed the data and a script to import it. The focus of this Markdown is to analyze the data and identify differences between members and casuals that can inform Cyclistic’s marketing strategy.
Let’s start with a quick overview of what the data looks like. If you’re interested to learn how I arrived at this data set, you can check out the data processing here.
# Glimpse
glimpse(data)
## Rows: 5,316,182
## Columns: 22
## $ ride_id <int> 1, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, …
## $ rideable_type <fct> classic_bike, classic_bike, classic_bike, electric_…
## $ started_at <dttm> 2021-04-12 18:25:36, 2021-04-17 09:17:42, 2021-04-…
## $ ended_at <dttm> 2021-04-12 18:56:55, 2021-04-17 09:42:48, 2021-04-…
## $ start_station_name <fct> State St & Pearson St, Honore St & Division St, Ash…
## $ start_station_id <fct> TA1307000061, TA1305000034, 16948, KA1503000069, 16…
## $ end_station_name <fct> Southport Ave & Waveland Ave, Southport Ave & Wavel…
## $ end_station_id <fct> 13235, 13235, 16948, KA1503000069, 16948, KA1503000…
## $ start_lat <dbl> 41.89745, 41.90312, 41.77937, 41.80583, 41.77937, 4…
## $ start_lng <dbl> -87.62872, -87.67394, -87.66484, -87.59248, -87.664…
## $ end_lat <dbl> 41.94815, 41.94815, 41.77937, 41.80580, 41.77937, 4…
## $ end_lng <dbl> -87.66394, -87.66394, -87.66484, -87.59266, -87.664…
## $ member_casual <fct> member, member, casual, casual, casual, casual, cas…
## $ ride_duration <dbl> 31.316667, 25.100000, 1.433333, 25.833333, 52.90000…
## $ day_of_week <ord> Mon, Sat, Sat, Tue, Mon, Sat, Sat, Sat, Tue, Sun, T…
## $ weekend <fct> weekday, weekend, weekend, weekday, weekday, weeken…
## $ week <dbl> 15, 15, 13, 14, 15, 16, 13, 16, 17, 13, 15, 14, 17,…
## $ month <ord> Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, A…
## $ hour <int> 18, 9, 16, 16, 15, 15, 18, 18, 18, 14, 12, 17, 16, …
## $ minute <dbl> 18.416667, 9.283333, 16.466667, 16.583333, 15.36666…
## $ second <dbl> 18.426667, 9.295000, 16.472500, 16.585000, 15.38166…
## $ date <date> 2021-04-12, 2021-04-17, 2021-04-03, 2021-04-06, 20…
# Summarize data
summary(data)
## ride_id rideable_type started_at
## Min. : 1 classic_bike :3194662 Min. :2021-04-01 00:03:18.0
## 1st Qu.:1465136 electric_bike:2121520 1st Qu.:2021-06-23 21:16:06.0
## Median :2913560 Median :2021-08-19 20:51:46.5
## Mean :2896478 Mean :2021-08-28 15:30:50.2
## 3rd Qu.:4333891 3rd Qu.:2021-10-16 16:39:39.0
## Max. :5723532 Max. :2022-03-31 23:59:47.0
##
## ended_at start_station_name
## Min. :2021-04-01 00:14:29.00 Streeter Dr & Grand Ave: 64425
## 1st Qu.:2021-06-23 21:35:18.25 Wells St & Concord Ln : 41950
## Median :2021-08-19 21:10:19.00 Clark St & Elm St : 39352
## Mean :2021-08-28 15:47:39.69 Michigan Ave & Oak St : 36958
## 3rd Qu.:2021-10-16 16:58:19.50 Wells St & Elm St : 36304
## Max. :2022-04-01 00:27:32.00 (Other) :4375139
## NA's : 722054
## start_station_id end_station_name
## 13022 : 64425 Streeter Dr & Grand Ave: 63326
## TA1308000050: 41950 Wells St & Concord Ln : 42280
## LF-005 : 41657 Clark St & Elm St : 38784
## TA1307000039: 39352 Michigan Ave & Oak St : 37074
## 13042 : 36958 Wells St & Elm St : 36215
## (Other) :4369786 (Other) :4339249
## NA's : 722054 NA's : 759254
## end_station_id start_lat start_lng end_lat
## 13022 : 63326 Min. :41.64 Min. :-87.84 Min. :41.39
## LF-005 : 47484 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88
## TA1308000050: 42280 Median :41.90 Median :-87.64 Median :41.90
## TA1307000039: 38784 Mean :41.90 Mean :-87.65 Mean :41.90
## 13042 : 37074 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93
## (Other) :4327980 Max. :42.07 Max. :-87.52 Max. :42.17
## NA's : 759254
## end_lng member_casual ride_duration day_of_week
## Min. :-88.97 casual:2199527 Min. : 1.000 Mon:683543
## 1st Qu.:-87.66 member:3116655 1st Qu.: 6.567 Tue:724709
## Median :-87.64 Median : 11.400 Wed:744932
## Mean :-87.65 Mean : 16.825 Thu:727881
## 3rd Qu.:-87.63 3rd Qu.: 20.150 Fri:761306
## Max. :-87.49 Max. :239.967 Sat:888503
## Sun:785308
## weekend week month hour
## weekday:3642371 Min. : 1.0 Jul : 749866 Min. : 0.00
## weekend:1673811 1st Qu.:22.0 Aug : 745893 1st Qu.:11.00
## Median :30.0 Sep : 707872 Median :15.00
## Mean :29.6 Jun : 663994 Mean :14.21
## 3rd Qu.:38.0 Oct : 596885 3rd Qu.:18.00
## Max. :52.0 May : 478734 Max. :23.00
## (Other):1372938
## minute second date
## Min. : 0.00 Min. : 0.00 Min. :2021-04-01
## 1st Qu.:11.42 1st Qu.:11.43 1st Qu.:2021-06-23
## Median :15.55 Median :15.55 Median :2021-08-19
## Mean :14.70 Mean :14.71 Mean :2021-08-28
## 3rd Qu.:18.37 3rd Qu.:18.37 3rd Qu.:2021-10-16
## Max. :23.98 Max. :24.00 Max. :2022-03-31
##
The data set is a collection of all Cyclistic bike rides between
April 2021 and March 2022. Each observation is a bike ride and the
different variables describe various aspects of each ride. Our main
variable of interest is member_casual, which tells us
whether a trip was made by a member or a casual. It will hopefully help
us reveal important differences in how members and casuals use
Cyclistic’s rental bikes. We can group the remaining variables based on
the questions they can answer about a trip.
rideable_typeride_durationstarted_atended_atday_of_weekweekendmonthhourminuteseconddatestart_station_namestart_station_idend_station_nameend_station_idstart_latstart_lngend_latend_lngIn our analysis, we’ll mainly look at two independent variables, frequency, or number of rides, and ride duration, which we’ll compare between members and casuals, as well as across a host of other categories. We are going to start with a simple comparison between members and casuals. From there, we’ll break it up further by bicycle type (classic vs electric). Afterwards, we’ll look at how classic and electric bike usage fluctuate throughout the day, week and year in terms of both ride frequency and duration - all of course while differentiating between members and casuals. Lastly, we are going to examine where members and casuals prefer to ride their bikes.
To get a first impression of the differences between the two groups, we’ll compare some simple summary statistics.
# Number of rides and ride duration by member_casual
data %>%
group_by(member_casual) %>%
summarize(total_rides = n(),
mean_duration = mean(ride_duration),
median_duration = median(ride_duration),
min_duration = min(ride_duration),
max_duration = max(ride_duration),
sd_duration = sd(ride_duration),
mad_duration = mad(ride_duration)) %>%
mutate(percentage = 100*total_rides/sum(total_rides)) %>%
relocate(1, 2, "percentage") %>%
kable_custom(caption = "Summary statistics by group") %>%
scroll_box(width = "100%")
| member_casual | total_rides | percentage | mean_duration | median_duration | min_duration | max_duration | sd_duration | mad_duration |
|---|---|---|---|---|---|---|---|---|
| casual | 2199527 | 41.4 | 22.4 | 14.8 | 1.0 | 240.0 | 23.9 | 11.1 |
| member | 3116655 | 58.6 | 12.9 | 9.5 | 1.0 | 239.9 | 12.0 | 7.0 |
The 2.199.527 casual rides make up around 41.4% of all rides. Ride duration range is the same in both groups (after we excluded rides with a duration below 1 and above 240 minutes in the cleaning process. Casual rides take longer on average and are more dispersed. Following up on these basic differences, we can gain a more thorough understanding of ride durations by looking at their distributions using histograms, a density plot and a cumulative frequency plot.
# Ride duration histogram
histo <- data %>%
my_ggplot() +
geom_histogram(aes(x = ride_duration, fill = member_casual),
binwidth = 5, boundary = 0, color = "black") +
coord_cartesian() +
scale_x_continuous(breaks = seq(0, 240, 30), minor_breaks = seq(0, 240, 10)) +
scale_y_continuous(breaks = seq(0, 10^6, 2.5*10^5), minor_breaks = NULL,
labels = scales::comma) +
theme(axis.title.x = element_blank()) +
ylab("Number of rides") +
facet_wrap(~member_casual, ncol = 1)
# Ride duration density
densplot <- data %>%
my_ggplot() +
geom_density(aes(x = ride_duration, color = member_casual)) +
coord_cartesian() +
scale_x_continuous(breaks = seq(0, 240, 30), minor_breaks = seq(0, 240, 10)) +
theme(legend.position = "none", axis.title.x = element_blank())
# Cumulative frequency plot
cumfreq <- data %>%
my_ggplot(aes(x = ride_duration, color = member_casual)) +
stat_ecdf() +
scale_y_continuous(breaks = seq(0, 1, 0.2), labels = scales::percent) +
scale_x_continuous(breaks = seq(0, 240, 30), minor_breaks = seq(0, 240, 10)) +
xlab("Ride duration in minutes") +
ylab("% of total rides") +
theme(legend.position = "none")
# Combine plots
histo + densplot + cumfreq +
plot_layout(nrow = 3, heights = c(2, 1, 1)) +
plot_annotation(title = "Ride duration distribution")
In order to properly interpret the data, it is important to understand the pricing structure. We’ll assume that Cyclistic uses the same prices as Divvy, which can be found on https://divvybikes.com/pricing. To summarize, classic bikes cost $0.16 per minute. However, members, who pay $119 for a yearly subscription, only start paying after 45 minutes. With a day pass, which costs $15, the first three hours of each ride are free of additional charge. In order to avoid additional payment, members and day pass holders can switch bikes before the time limit is reached, at which point the timer starts over at zero. As a casual without a day pass, you pay $1 to unlock a classic bike and then start paying $0.16 per minute immediately. The term casual refers to both non-members with and without a day pass. E-bike prices are $0.16 per minute for members and $0.39 per minute plus $1 to unlock for casuals, regardless of whether they have a day pass or not. With that in mind, we can now get back to interpreting the distribution of ride durations.
Rides between 0 and 10 minutes are far less common in casuals in both absolute and relative terms. This makes sense since casuals, having to pay 1$ to unlock a bike, are probably going to think twice whether it’s worth it to pick up a bike or not and day pass holders probably bought a day pass because they are more interested in longer rides. Members on the other hand, for whom every trip shorter than 15 minutes is free of additional charge, might be more likely to pick up a bike even for short trips. The absolute frequency is still lower in casuals for ride durations between 10 and 20 minutes, however, relative frequencies are very similar between both groups. Casuals and members both have similar absolute frequencies for ride durations between 20 and 30 minutes, which means that, given their lower total number of rides, the relative frequency is higher in casuals. Then, as duration increases, the number of rides declines quicker in members, meaning that both absolute and relative frequency for those ride durations are higher in casuals. We can see in the cumulative frequency plot that around 10% of casual trips go beyond the 45 minute threshold, while the same is true for only around 1% of member trips. We can quickly calculate the exact numbers.
# Table 45 min
data %>%
group_by(member_casual) %>%
summarize(percentage_over_45min = 100*sum(ride_duration > 45) / n()) %>%
kable_custom(caption = "Percentage of rides longer than 45 minutes")
| member_casual | percentage_over_45min |
|---|---|
| casual | 10.7 |
| member | 1.7 |
I can think of two explanations for this. One is day pass holders and the other is the fact that members can switch bikes before exceeding 45 minutes to avoid any costs, while for casuals, the per minute price for bike rides actually decreases over time due to the $1 fee that needs to be paid at the start of every trip. The number of rides becomes indistinguishable from zero after around 60 minutes for members and 120-150 minutes for casuals. The higher dispersion in ride duration in casuals tells us that they are a more heterogeneous group than members. This makes perfect sense, since casuals encompass both non-members with and without a day pass. However, there might be other reasons as well.
Now that we have a basic understanding of the differences between members and casuals, we can break the same metrics up further by bicycle type.
# Summary statistics by member_casual and rideable_type
data %>%
group_by(member_casual, rideable_type) %>%
summarize(total_rides = n(),
mean_duration = mean(ride_duration),
median_duration = median(ride_duration),
min_duration = min(ride_duration),
max_duration = max(ride_duration),
sd_duration = sd(ride_duration),
mad_duration = mad(ride_duration)) %>%
mutate(percentage = 100*total_rides/sum(total_rides)) %>%
relocate(1:3, "percentage") %>%
kable_custom(caption = "Summary statistics by group and rideable type") %>%
scroll_box(width = "100%")
| member_casual | rideable_type | total_rides | percentage | mean_duration | median_duration | min_duration | max_duration | sd_duration | mad_duration |
|---|---|---|---|---|---|---|---|---|---|
| casual | classic_bike | 1235595 | 56.2 | 24.5 | 16.0 | 1.0 | 239.9 | 26.1 | 11.9 |
| casual | electric_bike | 963932 | 43.8 | 19.6 | 13.3 | 1.0 | 240.0 | 20.4 | 9.9 |
| member | classic_bike | 1959067 | 62.9 | 13.2 | 9.9 | 1.0 | 239.9 | 11.6 | 7.2 |
| member | electric_bike | 1157588 | 37.1 | 12.4 | 8.8 | 1.0 | 239.9 | 12.6 | 6.5 |
E-bike rides are slightly more common in casuals than in members. Ride durations are higher in casuals regardless of bicycle type, which tells us that day passes, which do not include e-bikes, are definitely not the only reason for the between group difference in ride duration. Also, electric bike ride durations are generally lower across both groups. Let’s see whether we can find some patterns to explain those differences.
# Ride duration histogram by rideable type and membership
data %>%
my_ggplot() +
geom_histogram(aes(x = ride_duration, fill = member_casual),
binwidth = 5, boundary = 0, color = "black") +
coord_cartesian() +
scale_x_continuous(breaks = seq(0, 240, 30), minor_breaks = seq(0, 240, 10)) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Ride duration in minutes", y = "Number of rides",
title = "Ride frequency by ride duration",
subtitle = "By membership and rideable type") +
facet_grid(rows = vars(member_casual), cols = vars(rideable_type))
# Cumulative frequency plot
data %>%
my_ggplot(aes(x = ride_duration,
color = member_casual,
linetype = rideable_type)) +
stat_ecdf() +
scale_y_continuous(breaks = seq(0, 1, 0.2), labels = scales::percent) +
scale_x_continuous(breaks = seq(0, 240, 30), minor_breaks = seq(0, 240, 10)) +
labs(x = "Ride duration in minutes", y = "Number of rides",
title = "Cumulative ride frequency by ride duration",
subtitle = "By membership and rideable type")
While e-bike rides are on average shorter than classic bike rides, there don’t seem to be any particular patterns to explain those differences. It makes sense that e-bike trips tend to be shorter than classic ones since e-bikes are more expensive per minute, start charging per minute straight from the start even for members and allow most people to travel faster than on classic bikes. Hence, customers might be more careful not to use e-bikes for too long and, when used for the same routes, e-bikes should allow for shorter trip durations than classic bikes.
Even though there are clear differences between classic and e-bikes, these differences seem to be similar across members and casuals and thus do not reveal a lot about the differences between the two groups. It is important to note that the difference in e-bike ride duration between casuals and members cannot be attributed to day passes or the 45 minute threshold, since both don’t apply to e-bikes. However, members do not need to pay to unlock an e-bike, so they might again be more willing to pick it up for shorter durations. This is reflected in the far higher number of e-bike rides lasting less than 10 minutes in members.
Looking at how bike usage patterns vary over time, we are going to start with daily differences throughout the week. Differences between weekdays and weekends could be particularly insightful and it might make sense to differentiate between the two later on, when looking at hourly and monthly fluctuations throughout the day and year.
# Mean number of rides per day of the week
data %>%
group_by(day_of_week) %>%
mutate(n_days = n_distinct(date)) %>%
group_by(member_casual, rideable_type, day_of_week, weekend) %>%
summarize(n = n()/mean(n_days)) %>%
my_ggplot(aes(x = day_of_week, y = n,
fill = member_casual,
color = weekend)) +
geom_bar(stat = "identity") +
coord_cartesian(ylim = c(0, NA)) +
scale_color_manual(values = c("weekday" = "white", "weekend" = "black"),
guide = "none") +
theme(axis.text.x = element_text(face = c(rep("plain", 5), "bold", "bold"))) +
labs(x = NULL, y = "Mean number of rides",
title = "Mean number of rides per day of the week",
subtitle = "by membership and rideable type") +
facet_grid(rows = vars(member_casual), cols = vars(rideable_type))
The number of trips is relatively constant on weekdays in all four groups with around twice as many daily classic and slightly more electric bike rides in members compared to casuals. As we saw earlier, classic bike usage is only slightly higher than e-bike usage in casuals, whereas members clearly seem to prefer classic bikes over electric ones. On weekends, casuals rent notably more classic and slightly more electric bikes than during the week. In members, on the other hand, the number of trips stays constant or even slightly decreases on weekends. This could be an indication that casuals tend to rent bikes for more leisure activities. We know from the case study description, that around 30% of Cyclistic customers use their bikes to commute to work. Assuming that the majority of people work during the week rather than on weekends, the bulk of those 30% might be members. We’ll investigate this further by researching how bike rides vary over the course of a day after we compare ride durations between different days.
# Median ride duration by day of the week
data %>%
group_by(member_casual, rideable_type, day_of_week, weekend) %>%
summarize(ride_duration = median(ride_duration)) %>%
my_ggplot(aes(x = day_of_week, y = ride_duration,
fill = member_casual,
color = weekend)) +
geom_bar(stat = "identity") +
coord_cartesian(ylim = c(0, NA)) +
scale_color_manual(values = c("weekday" = "white", "weekend" = "black"),
guide = "none") +
theme(axis.text.x = element_text(face = c(rep("plain", 5), "bold", "bold"))) +
labs(x = NULL, y = "Median ride duration",
title = "Median ride duration by day of the week",
subtitle = "by membership and rideable type") +
facet_grid(rows = vars(member_casual), cols = vars(rideable_type))
Median ride durations are slightly higher on weekends regardless of membership and rideable type. This does not add a lot of insight, but makes sense, since people usually have more time to spare on weekends. They can be more relaxed on leisure rides, whereas they might be more rushed on commutes.
# Ride duration histogram weekday vs weekend
data %>%
group_by(weekend) %>%
mutate(n_days = n_distinct(date)) %>%
mutate(ride_duration = cut(ride_duration, seq(0,240,5))) %>%
group_by(member_casual, weekend, ride_duration) %>%
summarize(n = n()/mean(n_days)) %>%
mutate(ride_duration = seq(0,235,5)+2.5) %>%
my_ggplot() +
geom_col(aes(x = ride_duration, y = n, fill = member_casual), color = "black") +
coord_cartesian() +
scale_x_continuous(breaks = seq(0,240,30), minor_breaks = seq(0, 240, 10)) +
scale_y_continuous(labels = scales::comma) +
xlab("Ride duration in minutes") +
ylab("Mean daily number of rides") +
labs(title = "Ride duration histogram", subtitle = "Number of mean daily rides") +
facet_grid(rows = vars(member_casual), cols = vars(weekend))
In casuals, the number of rides increases evenly across all bins on weekend days, except for rides between 0 and 5 minutes, which stay around the same. In members, there are fewer rides below 15 minutes on weekends, but the number of longer rides is almost unchanged. The higher number of overall rides on weekends in casuals seems to be related to an even increase in the number of rides across all durations above 5 minutes. The overall slight decrease in the number of member rides on weekends seems to be mostly caused by a decrease in rides shorter than 15 minutes, which could be related to the lower number of commutes on weekends. Let’s dig even deeper and look into classic vs electric bikes. We can use a line graph to allow better comparisons between multiple groups. The following line graph is basically a histogram using a bin width of one minute.
# Ride duration distribution weekday vs weekend by rideable_type
data %>%
group_by(weekend) %>%
mutate(n_days = n_distinct(date)) %>%
ungroup() %>%
mutate(ride_duration = cut(ride_duration, seq(0,240,1))) %>%
group_by(member_casual, weekend, rideable_type, ride_duration) %>%
summarize(n = n()/mean(n_days)) %>%
mutate(ride_duration = row_number()-0.5) %>%
my_ggplot() +
geom_line(aes(x = ride_duration, y = n, color = member_casual, linetype = weekend),
alpha = 0.8) +
coord_cartesian() +
scale_x_continuous(breaks = seq(0,240,30), minor_breaks = seq(0, 240, 10)) +
scale_y_continuous(labels = scales::comma) +
xlab("Ride duration in minutes") +
ylab("Number of rides") +
ggtitle("Number of rides lasting x minutes") +
facet_wrap(~rideable_type)
This graph confirms our previous findings and also shows that, in casuals, e-bike usage only differs slightly between weekends and weekdays, whereas the difference is much bigger in classic bikes.
# Average number of rides by hour
data %>%
group_by(weekend) %>%
mutate(n_days = n_distinct(date)) %>%
group_by(member_casual, weekend, rideable_type, hour) %>%
summarize(n = n()/mean(n_days)) %>%
my_ggplot(aes(x = hour, y = n,
color = member_casual,
linetype = rideable_type)) +
geom_point() +
geom_line(aes(group = interaction(member_casual, rideable_type))) +
scale_x_continuous(breaks = seq(0,20,4),
labels = c("12am", "4am", "8am", "12pm", "4pm", "8pm")) +
coord_cartesian(ylim = c(0, NA)) +
labs(x = NULL, y = "Mean number of rides",
title = "Mean number of rides by hour",
subtitle = "by membership and rideable type") +
facet_wrap(~weekend, nrow = 1)
On weekdays, there are two distinct peaks in the number of rides in members. These happen between around 6am and 9am and between 3pm and 7pm. They are very likely caused by commutes. The same peaks are not visible in casuals, although they too display a sharp increase in rides after 3pm, but only a very minor one in the morning. This reinforces our hypothesis that members are more likely to rent bikes to commute to work. Other than these peaks, particularly the one in the morning, the fluctuations look similar between groups. This suggests that there might be certain members and casuals that display similar user habits, on top of which there are unique members that are responsible for the differences between both groups (alias the the morning peak). Weekend usage patterns are almost identical between members and casuals. Differences and similarities between both groups are independent of bicycle type.
# Ride duration by time of day
data %>%
group_by(member_casual, rideable_type, hour, weekend) %>%
summarize(ride_duration = median(ride_duration)) %>%
my_ggplot(aes(x = hour, y = ride_duration,
color = member_casual,
linetype = rideable_type,
group = interaction(member_casual, rideable_type))) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(0,20,4),
labels = c("12am", "4am", "8am", "12pm", "4pm", "8pm"),
minor_breaks = 0:23) +
scale_y_continuous(labels = scales::comma) +
coord_cartesian(ylim = c(0, NA)) +
labs(x = NULL, y = "Median ride duration",
title = "Median ride duration by time of the day",
subtitle = "by membership and rideable type") +
facet_wrap(~weekend, ncol = 2)
There is a distinct dip in ride duration on weekdays after 3am. People probably want to get home (or wherever they want to get to) as quickly as possible, causing them to go fast and possibly even get bikes for short distances that they might otherwise walk. This might also be related to other means of public transport being limited during this time of day. Between 6am and 8am, a steep increase in member ride duration causes ride duration to become almost identical between both groups. It is possible that, regardless of group, pretty much every trip in this time period is a commute with similar distances and thus durations in both groups. After this short period in which casuals and members form a relatively homogeneous group, their usual differences show again in higher ride duration in casuals. In the afternoon, the higher number of overall rides means that commuters do not have the same impact on mean and median ride duration that they have in the morning when there are barely any other rides.
# Mean number of daily rides by month
data %>%
group_by(month, weekend) %>%
mutate(n_days = n_distinct(date)) %>%
group_by(member_casual, weekend, rideable_type, month) %>%
summarize(n = n()/mean(n_days)) %>%
my_ggplot(aes(x = month, y = n,
color = member_casual,
linetype = rideable_type)) +
geom_point() +
geom_line(aes(group = interaction(member_casual, rideable_type))) +
coord_cartesian(ylim = c(0, NA)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(x = NULL, y = "Mean daily number of rides",
title = "Mean daily number of rides by month",
subtitle = "by membership and rideable type") +
facet_grid(rows = vars(rideable_type), cols = vars(weekend))
As expected, both members and casuals are most active during the summer and least active in the winter. E-bike rides follow a slightly different distribution. While the summer is, overall, still more popular than other seasons, there are also peaks March and October. Patricularly members tend to use e-bikes on weekdays at least as much from October to November as they do from June to September. This could mean that some commuters switch to e-bikes when it gets cold. Since members pay for the whole year, it is no surprise that their monthly fluctuations in bike usage are lower than those of casuals. This difference is least prominent for classic bikes on weekdays, even though this can partly be explained by members switching to e-bikes during the colder months. The biggest difference can be seen on weekends, where bike usage is higher in casuals only between May and September (October for e-bikes).
# Ride duration by month
data %>%
group_by(member_casual, rideable_type, month, weekend) %>%
summarize(ride_duration = median(ride_duration)) %>%
my_ggplot(aes(x = month, y = ride_duration,
color = member_casual,
linetype = rideable_type,
group = interaction(member_casual, rideable_type))) +
geom_point() +
geom_line() +
scale_y_continuous(labels = scales::comma) +
coord_cartesian(ylim = c(0, NA)) +
labs(x = NULL, y = "Median ride duration",
title = "Median ride duration by month",
subtitle = "by membership, rideable type and type of day") +
facet_wrap(~weekend, ncol = 2)
Ride duration is highest in spring in all groups. I don’t see any interesting patterns or interactions.
Here is a summary of ride frequency and duration:
Other than when members and casuals use our bikes, we can also look at differences in terms of where they use them. We could use these insights to develop a marketing campaign e.g. by advertising or planning events at specific stations. In order to do so, let’s look at the most popular stations. We’ll also keep differentiating between weekdays and weekends, as we now know that there are substantial differences between the two in terms of customer behavior. In order to do all that, I’ll use Tableau Public. Without a credit card, I cannot enable the Google Maps API, which makes it a lot harder to create a good looking map in R. Accordingly, I chose to use Tableau and paste the images here. You can find the original Visualizations here and here. And this is the script I used to transform the data for use in Tableau.
The differences between weekdays and weekends are more pronounced in members. This makes perfect sense since the high number of commutes during the week means that there is going to be a larger difference compared to weekends, when the number of commutes is significantly lower. This is further supported by the higher similarities in stations between members and casuals on weekends compared to weekdays. A lot of the casuals’ favorite stations, as well as some stations that are popular among members on weekends, are located around the coast and parks, while the stations most frequently used by members (especially during the week) are predominantly located in Chicago’s business district.
It seems that casuals behave similarly on weekends as they do on weekdays, as they preferably use Cyclistic bikes for leisure time activities on both. Members, on the other hand, show larger differences between weekends and weekdays, indicating that during the week, they primarily use our bikes to commute, while on weekends, they rent bikes for leisure time activities, similarly to casuals.
We can also look at people’s favorite trips rather than just stations.
This comparison between trips reflects a lot of the same patterns as the one between stations. Furthermore, we can see that round trips are a lot more popular among casuals, adding even more support to our previous hypothesis, given that a round trip cannot be a commute. There are three distinct areas in which all of the 20 most frequent weekday trips occur in members. All three are university campuses and each of the trips is performed between 3 and 14 times per weekday. Apparently, Cyclistic is highly popular among university students and/or professionals. I was going to check how popular those trips and stations are among casuals and then possibly suggest partnering with universities for promotion. However, I checked Divvy’s website and found that they already have partnerships with various universities. It is possible though, that those partnerships are not very well advertised, making them a possible target for our campaign. Let’s see how these routes are doing among casuals.
# Top 20 weekday trips for members
top20 <- data %>%
filter(!is.na(start_station_id) & !is.na(end_station_id) &
member_casual == "member" & weekend == "weekday") %>%
group_by(member_casual, weekend, start_station_id, end_station_id) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
ungroup() %>%
slice_head(n = 20) %>%
select(start_station_id, end_station_id)
# Trips in other groups
top20 <- data %>%
group_by(weekend) %>%
mutate(n_days = n_distinct(date)) %>%
inner_join(top20, by = c("start_station_id", "end_station_id")) %>%
group_by(member_casual, weekend, start_station_id, end_station_id) %>%
summarize(n = n(),
n_daily = round(n/mean(n_days), 1)) %>%
arrange(desc(n)) %>%
mutate(n = paste0(n, " (", n_daily, ")")) %>%
select(-n_daily) %>%
group_by(member_casual, weekend) %>%
slice_head(n = 20) %>%
pivot_wider(names_from = c("member_casual", "weekend"), values_from = c("n"))
top20 %>%
kable_custom(caption = "Total (mean daily) rides for university routes") %>%
scroll_box(width = "100%", height = "300px")
| start_station_id | end_station_id | casual_weekday | casual_weekend | member_weekday | member_weekend |
|---|---|---|---|---|---|
| KA1503000014 | KA1504000076 | 722 (2.8) | 218 (2.1) | 3534 (13.5) | 1050 (10.1) |
| KA1504000076 | KA1503000014 | 639 (2.4) | 241 (2.3) | 3030 (11.6) | 985 (9.5) |
| TA1309000037 | KA1503000071 | 466 (1.8) | 112 (1.1) | 1035 (4) | 316 (3) |
| KA1503000014 | KA1503000071 | 448 (1.7) | 189 (1.8) | 3045 (11.7) | 1025 (9.9) |
| KA1503000071 | TA1309000037 | 447 (1.7) | 120 (1.2) | 1154 (4.4) | 271 (2.6) |
| KA1503000071 | KA1503000014 | 364 (1.4) | 134 (1.3) | 2989 (11.5) | 969 (9.3) |
| TA1309000063 | KA1503000071 | 220 (0.8) | 52 (0.5) | 989 (3.8) | 155 (1.5) |
| KA1503000071 | TA1309000063 | 171 (0.7) | 55 (0.5) | 999 (3.8) | 142 (1.4) |
| TA1307000130 | 13332 | 154 (0.6) | 21 (0.2) | 1662 (6.4) | 200 (1.9) |
| 13332 | TA1307000130 | 137 (0.5) | 23 (0.2) | 1807 (6.9) | 224 (2.2) |
| 13217 | 13216 | 101 (0.4) | 32 (0.3) | 1787 (6.8) | 405 (3.9) |
| KA1503000014 | TA1309000011 | 98 (0.4) | 30 (0.3) | 1174 (4.5) | 178 (1.7) |
| 13332 | TA1307000121 | 78 (0.3) | 14 (0.1) | 1155 (4.4) | 320 (3.1) |
| TA1309000011 | KA1503000014 | 73 (0.3) | 29 (0.3) | 1112 (4.3) | 163 (1.6) |
| 13216 | 13217 | 68 (0.3) | 22 (0.2) | 1700 (6.5) | 382 (3.7) |
| 13332 | TA1309000064 | 61 (0.2) | 14 (0.1) | 898 (3.4) | 221 (2.1) |
| TA1307000121 | 13332 | 41 (0.2) | 28 (0.3) | 999 (3.8) | 294 (2.8) |
| TA1309000064 | 13332 | 38 (0.1) | 13 (0.1) | 1007 (3.9) | 192 (1.8) |
| TA1307000139 | 13216 | 35 (0.1) | 12 (0.1) | 1134 (4.3) | 286 (2.8) |
| 13216 | TA1307000139 | 29 (0.1) | 13 (0.1) | 1130 (4.3) | 270 (2.6) |
With the most popular route amounting to 722 weekday rides over the course of the year, or around 3 per weekday, there does not seem to be a lot of potential in casuals at universities. There might be huge potential in universities overall since, even though the 20 most popular trips among members are all on university campuses, the numbers are still relatively low compared to the number of enrolled students. However, this is beyond our scope, as our aim is converting casuals not members, rather than recruiting entirely new customers.
Here is a short summary of our geographic analysis:
If we want to use our knowledge of the most popular stations to organize events, it is important to know when these stations are most busy. Even though we know when the majority of rides take place in general, we cannot be sure whether that is true for each individual station. When aiming to choose stations from which we can convert the highest number of members, there are two things we need to look for:
While the first one is straightforward, the second one is a lot harder to assess. What makes a person likely to sign up for a membership? Unfortunately, we don’t have any data on our members, so we’ll have to speculate using common sense. Locals will be more likely to become members than people from outside of Chicago. The same is true to a lesser degree for outsiders that regularly visit Chicago. These people might be professionals working in Chicago or regularly going there for business trips, or they might be tourists from the surrounding area that visit regularly. Tourists that visit Chicago only once or twice a year, or even less often, will very likely not become Cyclistic members. We unfortunately have no access to any personal information, including addresses, of our customers. The data set we are working with contains no personal identifiers. This means that we don’t know whether a customer rented a bike more than once. For all we know, it would be possible that no two trips in our data set were made by the same customer. Without all this information, we need to use our knowledge of when a trip occurs to deduce whether a customer is more or less likely to be local. I believe it is not too far fetched, for example, that the percentage of non-local tourists is higher on weekends. There are more tourists in town and the number of commutes should be significantly lower. We can further use common sense and our previously attained knowledge of customer activity patterns to divide weekdays into four distinct time periods, each associated with a certain customer demographic:
If, for example, a station is highly busy in the morning, that probably means that it is used a lot by commuters. If there is a notable number of casuals among those commuters, then these are exactly the individuals that should be relatively easy to convert to members. On the other hand, if a station is mostly busy between 9am and 4pm, then that might be an indicator that it is mostly used by non-locals. However, it is important to also consider how members use a station. For example, if a station is really popular among members between 9am and 4pm, it is more likely that the casuals using it at the same time are inclined to sign up for a membership, as they apparently share some characteristics with a lot of members. Overall, if a station is used similarly by members and casuals, that is something to keep an eye on. Furthermore, we can look at monthly fluctuations at specific stations. If a station is busy year-round, it should be at least in part due to certain people using it year-round. These people should be easier to convert to members. To sum up, there are three things that we are looking for in a station in terms of the likelihood of its users to convert to members:
The first point is less important on weekends, when there don’t seem to be any distinct time periods that reflect certain types of customers. We can use the data to 1) assess the total number of casuals visiting each stations and 2) estimate how likely the casuals at those stations are to convert. We obviously cannot do that for all 3334 stations, so we’ll focus on the top 10 stations that we’ve identified before. We identfied a top 10 for four subgroups. Due to the overlap between the four groups, this leaves us with 19 unique stations. We’ll even include the stations that were not in the top 10 in casuals for either weekdays or weekends. Even though their total casual visits did not place them in the top 10, the fact that they are so popular among members might mean that the casuals using them might be more likely to convert.
Let’s begin by examining overall visits by casuals and members for each of the 19 stations on weekdays and weekends. We’ll sort them by casual visits from highest to lowest.
# Find top 10 after 4pm <- in casuals
stations <- data %>%
group_by(weekend) %>%
mutate(n_days = n_distinct(date)) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter(!is.na(station_id)) %>%
group_by(member_casual, weekend, station_id) %>%
summarize(total_rides = n_distinct(ride_id),
mean_rides = total_rides/mean(n_days),
total_visits = n(),
mean_visits = total_visits/mean(n_days)) %>%
arrange(desc(total_rides)) %>%
mutate(number = row_number())
# Factor ordered by traffic in casuals overall
stations <- stations %>%
ungroup() %>%
filter(member_casual == "casual") %>%
mutate(id_ordered = reorder(station_id, desc(total_rides))) %>%
select(station_id, id_ordered, weekend) %>%
right_join(stations, by = c("station_id", "weekend"))
# Reorder weekday ID
stations <- stations %>%
ungroup() %>%
filter(member_casual == "casual" & weekend == "weekday") %>%
mutate(id_weekday = reorder(station_id, number)) %>%
select(station_id, id_weekday, weekend) %>%
right_join(stations, by = c("station_id", "weekend"))
# Reorder weekend ID
stations <- stations %>%
ungroup() %>%
filter(member_casual == "casual" & weekend == "weekend") %>%
mutate(id_weekend = reorder(station_id, number)) %>%
select(station_id, id_weekend, weekend) %>%
right_join(stations, by = c("station_id", "weekend"))
# Top 10 stations
top10 <- stations %>%
filter(number <= 10) %>%
ungroup() %>%
mutate(group = interaction(member_casual, weekend),
n = 1) %>%
select(station_id, group, n) %>%
pivot_wider(names_from = "group", values_from = "n", values_fill = list(n = 0))
# Plot
stations %>%
ungroup() %>%
filter(station_id %in% top10$station_id) %>%
my_ggplot(aes(x = mean_rides,
y = id_ordered,
fill = member_casual)) +
geom_col(position = position_dodge2(reverse = T)) +
scale_y_discrete(limits=rev) +
theme(axis.title.y = element_blank()) +
labs(title = "Mean daily number of rides", subtitle = "Top 10 stations") +
facet_wrap(~weekend, ncol = 2, scales = "free_x")
On weekdays, the more popular stations among casuals are generally the less popular ones among members, although there are some exceptions to this. The pattern looks a bit different on weekdays, where the most popular stations in one group enjoy average popularity in the other group. The least popular stations are mostly the same in both groups. Overall, station 13022 really stands out in that it is by far the busiest and involved in the majority of top 20 trips among casuals on both weekdays and weekends. However, its popularity among members is significantly lower. We’ll leave out station 13022 for the upcoming analyses of station station traffic by month and hour. Its traffic is so high that its inclusion would negatively affect readability for other stations by scaling the y-axis. We’ll obviously look into this station later though.
# Function to plot traffic
plot_traffic <- function (x, ids_df, stations, days_by, plot_by) {
# Summarize x
x <- x %>%
group_by(across(all_of(days_by))) %>%
mutate(n_days = n_distinct(date)) %>%
filter(start_station_id %in% stations |
end_station_id %in% stations) %>%
select(start_station_id, end_station_id, ride_id, member_casual,
weekend, month, started_at, hour, n_days) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter(station_id %in% stations) %>%
group_by(station_id, member_casual,
across(all_of(unique(c(days_by, plot_by))))) %>%
summarize(n = n_distinct(ride_id)/mean(n_days))
# Join to reorder stations
x <- left_join(x, ids_df, by = c("station_id", "weekend"))
# Plot
x %>%
my_ggplot(aes(x = !!sym(plot_by), y = n,
color = member_casual)) +
geom_point() +
geom_line(aes(group = member_casual)) +
coord_cartesian(ylim = c(0, NA))
}
# Plot by month for weekdays and weekends
plot_list <- list()
day_type <- c("weekday", "weekend")
titles <- c("Weekdays", "Weekends")
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(weekend == day_type[i]) %>%
plot_traffic(select(stations, station_id, weekend, starts_with("id_")),
stations = top10$station_id[top10$station_id != "13022"],
days_by = c("weekend", "month"),
plot_by = "month") +
facet_wrap(~id_ordered, ncol = 9, as.table = T) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(x = NULL, y = "Mean daily rides",
subtitle = titles[i])
}
plot_list[[2]] <- plot_list[[2]] + theme(legend.position = "none")
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 1) +
plot_annotation(title = "Mean daily rides by month",
subtitle = "Top stations")
# Plot
plot_list <- list()
day_type <- c("weekday", "weekend")
titles <- c("Weekdays", "Weekends")
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(weekend == day_type[i]) %>%
plot_traffic(select(stations, station_id, weekend, starts_with("id_")),
stations = top10$station_id[top10$station_id != "13022"],
days_by = c("weekend"),
plot_by = "hour") +
facet_wrap(~id_ordered, ncol = 9, as.table = T) +
scale_x_continuous(breaks = seq(0,20,4),
labels = c("12am", "4am", "8am", "12pm", "4pm", "8pm")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(x = NULL, y = "Mean hourly rides",
subtitle = titles[i])
}
plot_list[[2]] <- plot_list[[2]] + theme(legend.position = "none")
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 1) +
plot_annotation(title = "Mean hourly rides",
subtitle = "Top stations")
As a reminder, we are looking for stations that are 1) busy in the morning and/or afternoon, before/after normal working hours, 2) simultaneously busy in members and casuals, 3) busy year-round and 4) highly popular among casuals in general.
Right of the bat we can see that casual activity never peaks in the morning. However, it does peak in the afternoon in a lot of stations. An afternoon peak that is higher than the morning peak for that station likely means that a substantial number of the rides in that afternoon peaks are not commutes, but rather after-work leisure rides. The same is obviously true for a completely missing morning peak. Accordingly, in line with our previous findings, casuals seem to barely use Cyclistic bikes to commute. However, they do use them a lot for after-work leisure rides. The stations with a pronounced afternoon peak are highly interesting to us, because the proportion of locals compared to non-locals should be highest around that time. There are five stations with no true afternoon peak: 13008, 13042, 13300, 15544 and 13016. The first three stations have pretty low member activity on both weekdays and weekends. However, they are among the most visited stations by casuals and thus not totally irrelevant to us.
Station 15544 behaves completely different from the rest of the stations. It has a very distinct peak at around 1pm to 2pm on weekdays and weekends and is basically not used at all by members. It is located at the Shedd Aquarium, Adler Planetarium and Field Museum which close at 6pm, 4pm and 5pm. The area is highly touristic and thus the large majority of people using the station are very likely tourists. Overall, the number of visiting casuals is not extraordinarily high. This station is not very interesting to us. Lastly, station 13016 has very low casual activity in general and is only included because member activity is very high, mostly due to commutes. It too is thus irrelevant to us.
There is an afternoon peak in casual activity in all 13 remaining stations. For all these stations, casuals behave somewhat similar to members. However, a lot of them have very low overall casual traffic. Most of these stations could be good targets for something like poster advertising, but only a few of them are busy enough for promotional events to make sense. In my opinion, these are LF-005, TA1308000050, KA1504000135, TA1307000039, TA1308000001, 13179 and 13146. Let’s quickly examine station 13022 now.
# Plot by month for weekdays and weekends
plot_month <- data %>%
plot_traffic(select(stations, station_id, weekend, starts_with("id_")),
stations = "13022",
days_by = c("weekend", "month"),
plot_by = "month") +
facet_wrap(~weekend) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(x = NULL, y = "Mean daily rides", subtitle = "By month")
# Plot by hour for weekdays and weekends
plot_hour <- data %>%
plot_traffic(select(stations, station_id, weekend, starts_with("id_")),
stations = "13022",
days_by = c("weekend"),
plot_by = "hour") +
facet_wrap(~weekend) +
scale_x_continuous(breaks = seq(0,20,4),
labels = c("12am", "4am", "8am", "12pm", "4pm", "8pm")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(x = NULL, y = "Mean hourly rides", subtitle = "By hour")
plot_month + plot_hour +
plot_layout(ncol = 2) +
plot_annotation(title = "Station activity",
subtitle = "Station 13022")
We have already seen the station’s popularity in the map we’ve created earlier. It is by far the most popular station in casuals and involved in the majority of the casuals’ top 20 favorite trips on both weekdays and weekends. Its higher popularity on weekdays compared to weekends and in casuals compared to members, as well as its location on the coast, where already indicative of it being mostly touristic. The tiny morning and afternoon peaks further support this hypothesis. However, just like stations 13008, 13042 and 13300, the station remains interesting due to its extremely high overall casual usage. This leaves us with a total of eleven stations to further examine. We can split them into two groups:
If we were to hold an event, it would make sense to do so at the start of the season. Since activity peaks somewhere around June/July/August for all stations, it makes sense to hold events around May/June. Let’s see how many people we would be able to reach during the busiest time of day in those months. We’ll look at both groups of stations individually.
# Stations
stations_regular <- c("LF-005", "TA1308000050", "KA1504000135", "TA1307000039",
"TA1308000001", "13179", "13146")
# Plot
plot_list <- list()
day_type <- c("weekday", "weekend")
titles <- c("Weekdays", "Weekends")
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(weekend == day_type[i] & month %in% c("May", "Jun")) %>%
plot_traffic(select(stations, station_id, weekend, starts_with("id_")),
stations = stations_regular,
days_by = c("weekend", "month"),
plot_by = "hour") +
facet_grid(cols = vars(id_ordered),
rows = vars(month),
as.table = T) +
scale_x_continuous(breaks = seq(0,20,4),
labels = c("12am", "4am", "8am", "12pm", "4pm", "8pm")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(x = NULL, y = "Mean hourly rides",
subtitle = titles[i])
}
plot_list[[2]] <- plot_list[[2]] + theme(legend.position = "none")
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 1) +
plot_annotation(title = "Mean hourly rides by month",
subtitle = "Regular stations")
All stations are significantly busier in June. Let’s see whether the same is true for the four more touristic stations.
# Stations
stations_tourist <- c("13022", "13008", "13042", "13300")
# Plot
plot_list <- list()
day_type <- c("weekday", "weekend")
titles <- c("Weekdays", "Weekends")
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(weekend == day_type[i] & month %in% c("May", "Jun")) %>%
plot_traffic(select(stations, station_id, weekend, starts_with("id_")),
stations = stations_tourist,
days_by = c("weekend", "month"),
plot_by = "hour") +
facet_grid(cols = vars(id_ordered),
rows = vars(month),
as.table = T) +
scale_x_continuous(breaks = seq(0,20,4),
labels = c("12am", "4am", "8am", "12pm", "4pm", "8pm")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(x = NULL, y = "Mean hourly rides",
subtitle = titles[i])
}
plot_list[[2]] <- plot_list[[2]] + theme(legend.position = "none")
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 1) +
plot_annotation(title = "Mean hourly rides by month",
subtitle = "Touristic stations")
June is clearly busier than May in all stations. The differences are particularly pronounced on weekdays. We can try to find the best time to start a promotion by plotting the average number of visits during busiest period of the day by calendar week. We’ll use 4pm to 10pm for weekdays and 11am to 5pm for weekends for the regular stations.
# Regular stations weekly peak hours traffic
data %>%
filter(month %in% c("May", "Jun", "Jul")) %>%
group_by(weekend, week) %>%
mutate(n_days = n_distinct(date)) %>%
filter((start_station_id %in% stations_regular |
end_station_id %in% stations_regular) &
(between(hour, 16, 22) & weekend == "weekday") |
(between(hour, 11, 17) & weekend == "weekend")) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter(station_id %in% stations_regular) %>%
group_by(station_id, weekend, member_casual, week) %>%
summarize(n = n()/mean(n_days)) %>%
left_join(select(stations, station_id, id_ordered), by = "station_id") %>%
my_ggplot(aes(x = week, y = n)) +
geom_point(stat = "summary", fun = "mean", aes(color = member_casual)) +
geom_line(stat = "summary", fun = "mean", aes(color = member_casual)) +
labs(x = "Calendar week", y = "Mean daily rides",
title = "Mean peak hours rides by week", subtitle = "Regular stations") +
facet_grid(cols = vars(id_ordered), rows = vars(weekend))
The weekly fluctuations are huge, but we can recognize a trend that most stations display a linear increase in activity between weeks 18 and 24, after which they fluctuate around that new baseline. For the touristic stations, we’ll stick with 11am to 5pm for weekends and use 12pm to 8pm for weekdays.
# Touristic stations weekly peak hours traffic
data %>%
filter(month %in% c("May", "Jun", "Jul")) %>%
group_by(weekend, week) %>%
mutate(n_days = n_distinct(date)) %>%
filter((start_station_id %in% stations_tourist |
end_station_id %in% stations_tourist) &
(between(hour, 12, 20) & weekend == "weekday") |
(between(hour, 11, 17) & weekend == "weekend")) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter(station_id %in% stations_tourist) %>%
group_by(station_id, weekend, member_casual, week) %>%
summarize(n = n()/mean(n_days)) %>%
left_join(select(stations, station_id, id_ordered), by = "station_id") %>%
my_ggplot(aes(x = week, y = n)) +
geom_point(stat = "summary", fun = "mean", aes(color = member_casual)) +
geom_line(stat = "summary", fun = "mean", aes(color = member_casual)) +
labs(x = "Calendar week", y = "Mean daily rides",
title = "Mean peak hours rides by week", subtitle = "Touristic stations") +
facet_grid(cols = vars(id_ordered), rows = vars(weekend))
The patterns look exactly the same as they did in the other stations. The large fluctuations might be related to weather, specifically rainfall. To test this hypothesis, I downloaded historical weather data for the Chicago Area from https://www.weather.gov/wrh/climate?wfo=lot. You can find the script I used to import the data here. Since the weekly fluctuations are more or less the same in all stations, we can sum the number of rides for all stations and plot it against temperature and precipitation for each week.
# Load weather data
load("data/r_data/weather.rda")
# Combine all stations
stations_all <- c(stations_regular, stations_tourist)
# Plot
data %>%
filter(month %in% c("May", "Jun", "Jul")) %>%
group_by(weekend, week) %>%
mutate(n_days = n_distinct(date)) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter((between(hour, 16, 22) &
weekend == "weekday" &
station_id %in% stations_regular) |
(between(hour, 12, 20) &
weekend == "weekday" &
station_id %in% stations_tourist) |
(between(hour, 10, 16) &
weekend == "weekend" &
station_id %in% stations_all)) %>%
group_by(weekend, member_casual, week) %>%
summarize(n = n_distinct(ride_id)/mean(n_days)) %>%
left_join(weather %>%
group_by(week, weekend) %>%
summarize(temp = mean(temp),
precip = mean(precip)),
by = c("week", "weekend")) %>%
my_ggplot(aes(x = week, y = n)) +
geom_bar(stat = "summary", fun = "mean",
aes(y = precip*3000, fill = "precipitation (in)")) +
geom_line(aes(y = temp*25-1000, color = "temperature")) +
geom_point(aes(y = temp*25-1000)) +
geom_line(aes(color = member_casual)) +
geom_point(aes(color = member_casual)) +
geom_text(aes(y = 50,
label = ifelse(precip == 0, NA, round(precip, 2))),
size = 3) +
scale_x_continuous(breaks = seq(17, 30, 1)) +
scale_y_continuous(breaks = seq(0, 3000, 500),
sec.axis = sec_axis(~./25+40, name = "Temperature (°F)",
breaks = seq(0, 100, 20))) +
scale_color_manual(name = "",
values = c("casual" = "#F28E2A",
"member" = "#37808E",
"temperature" = "black")) +
scale_fill_manual(name = "", values = c("precipitation (in)" = "grey")) +
labs(x = "Calendar week", y = "Mean daily rides",
title = "Weekly traffic and weather") +
facet_wrap(~weekend, nrow = 1, scales = "fixed")
It looks like there is a correlation between the number of rides at our seven stations and precipitation and to a lesser degree temperature. There has been rain and unusually low temperatures in all of the least busy weeks. However, one of the busiest weekends is also the one with the second highest precipitation. It makes sense for the amount of precipitation to not be perfectly correlated with bike usage, since, apart from other factors, precipitation frequency and timing play a huge role. For example, a lot of rain on one day of the week might result in higher total precipitation than a little rain on seven days, but would probably have a smaller impact on the number of weekly bike rides. Also, nightly rainfall would likely barely affect bike usage. Let’s plot traffic during peak time by day and highlight rainy days. It makes sense to do this individually for each group of stations, since there might be different trends in terms of days of the week.
station_list <- list("Regular" = stations_regular, "Touristic" = stations_tourist)
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(month %in% c("May", "Jun", "Jul")) %>%
filter(member_casual == "casual") %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter((between(hour, 16, 22) &
weekend == "weekday" &
station_id %in% stations_regular) |
(between(hour, 12, 20) &
weekend == "weekday" &
station_id %in% stations_tourist) |
(between(hour, 10, 16) &
weekend == "weekend" &
station_id %in% stations_all)) %>%
filter(station_id %in% station_list[[i]]) %>%
mutate(station_type = ifelse(station_id %in% stations_regular,
"Regular", "Touristic")) %>%
left_join(select(weather, date, temp, precip), by = "date") %>%
group_by(weekend, member_casual, date, station_type) %>%
summarize(n = n_distinct(ride_id),
temp = mean(temp),
precip = mean(precip, na.rm = T)) %>%
mutate(precip_bi = ifelse(precip >= 0.01, "Precipitation", "No precipitation")) %>%
my_ggplot(aes(x = date, y = n)) +
scale_color_manual(values = c("Precipitation" = "black", "No precipitation" = "white")) +
geom_bar(stat = "summary", fun = "mean",
fill = "#F28E2A", aes(color = precip_bi)) +
scale_x_date(date_breaks = "1 week", labels = scales::date_format("%W")) +
labs(x = "Calendar week", y = "Number of rides",
subtitle = names(station_list)[[i]]) +
facet_wrap(~weekend, nrow = 2, scales = "fixed")
}
plot_list[[1]] <- plot_list[[1]] + theme(axis.title.x = element_blank())
plot_list[[2]] <- plot_list[[2]] + theme(legend.position = "none")
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 1) +
plot_annotation(title = "Daily number of rides",
subtitle = "Effect of precipitation")
Rainfall is obviously associated with lower activity and the few exceptions are probably related to nightly or at least time-limited rainfall. 22 of 65 weekdays and 10 of 27 weekend days were rainy between May and July. Finally, let’s look at a boxplot comparing dry and rainy days.
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(month %in% c("May", "Jun", "Jul")) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
left_join(select(weather, date, temp, precip), by = "date") %>%
filter((between(hour, 16, 22) &
weekend == "weekday" &
station_id %in% stations_regular) |
(between(hour, 12, 20) &
weekend == "weekday" &
station_id %in% stations_tourist) |
(between(hour, 10, 16) &
weekend == "weekend" &
station_id %in% stations_all)) %>%
filter(station_id %in% station_list[[i]]) %>%
group_by(weekend, member_casual, date) %>%
summarize(n = n_distinct(ride_id),
temp = mean(temp),
precip = mean(precip, na.rm = T)) %>%
mutate(precip_bi = ifelse(precip >= 0.01, "rainy", "dry")) %>%
my_ggplot(aes(x = precip_bi, y = n, fill = member_casual)) +
geom_boxplot() +
theme(axis.title.x = element_blank()) +
facet_wrap(~member_casual+weekend, nrow = 1, scales = "fixed",
labeller = label_wrap_gen(multi_line=FALSE)) +
labs(y = "Daily rides", subtitle = names(station_list)[i])
}
plot_list[[2]] <- plot_list[[2]] + theme(legend.position = "none")
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 1) +
plot_annotation(title = "Daily number of rides at top stations",
subtitle = "May to July")
The effect of rain seems to be bigger on weekends compared to weekdays, in casuals compared to members and at touristic stations compared to regular ones. In conclusion, non-rainy days in June seem to be the best time to hold events. We could compare individual days to find out which days of the week are best suited for events. However, such a comparison would probably not be too reliable due to the small number of observations and the large effect of confounding variables such as weather and time of the year. Looking at the previous bar plot of the daily number of casual rides, Mondays seem to be a bit quieter than other weekdays at the regular stations, while Mondays and Fridays seem to be the busiest days at the more touristic stations. However, these days probably also have the highest percentage of non-local tourists. There does not seem to big a big difference between Saturdays and Sundays.
Lastly, let’s see how often each of the stations is used as a starting point and for roundtrips.
for (i in 1:2) {
plot_list[[i]] <- data %>%
filter(month %in% c("May", "Jun", "Jul")) %>%
pivot_longer(cols = c("start_station_id", "end_station_id"),
names_to = c("type", ".value"),
names_pattern = '(^[^_]+(?=_))_(.*)') %>%
filter((between(hour, 16, 22) &
weekend == "weekday" &
station_id %in% stations_regular) |
(between(hour, 12, 20) &
weekend == "weekday" &
station_id %in% stations_tourist) |
(between(hour, 10, 16) &
weekend == "weekend" &
station_id %in% stations_all)) %>%
filter(station_id %in% station_list[[i]]) %>%
mutate(station_type = ifelse(station_id %in% stations_regular,
"Regular", "Touristic")) %>%
group_by(station_id, weekend, member_casual, station_type) %>%
summarize(distinct = n_distinct(ride_id),
total = n(),
start = sum(type == "start"),
end = sum(type == "end")) %>%
ungroup() %>%
mutate(roundtrip_tot = total - distinct,
start_tot = start - roundtrip_tot,
end_tot = end - roundtrip_tot,
roundtrip_perc = 100*roundtrip_tot / distinct,
start_perc = 100*start_tot / distinct,
end_perc = 100*end_tot / distinct) %>%
pivot_longer(cols = matches("^start_|^end_|^roundtrip_"), names_sep = "_",
names_to = c("type", ".value")) %>%
mutate(type = factor(type, levels = c("start", "end", "roundtrip")),
perc = paste0(round(perc, 1), "%")) %>%
left_join(select(stations, station_id, id_ordered), by = "station_id") %>%
my_ggplot(aes(x = type, y = tot, fill = member_casual, label = perc)) +
geom_col(position = position_dodge()) +
geom_text(nudge_y = 450, size = 2) +
labs(y = "Total number", subtitle = names(station_list)[i]) +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
facet_grid(cols = vars(weekend, member_casual), rows = vars(id_ordered),
as.table = T)
}
plot_list[[2]] <- plot_list[[2]] + theme(axis.title.y = element_blank())
plot_list[[1]] <- plot_list[[1]] + theme(strip.text.y = element_text(size = 5))
plot_list[[1]] + plot_list[[2]] +
plot_layout(ncol = 2) +
plot_annotation(title = "Station usage",
subtitle = "Starting point, destination and roundtrip")
There are some differences indicating that certain stations are more popular as destinations whereas other tend to serve more as starting points. Roundtrips are more often made at touristic stations. I cannot however see any extreme patterns that should affect our strategy.
Summary:
We were told before our analysis that most customers use Cyclistic bikes for leisure activities and around 30% use them to commute. We found strong evidence that the majority of those commuters are members, while casuals use Cyclistic bikes almost exclusively for leisure activities. More precisely, there seem to be two groups of casuals: Tourists are mostly active on weekends and, to a lesser degree, on weekdays from morning to late afternoon. They particularly target central stations around popular sights. The other group of casuals are locals, who use Cyclistic bikes mostly for after-work leisure activities such as visiting parks. Even though the total number of rides made by locals seems to be lower than that made by tourists, we expect them to be a better target for our campaign, since memberships are more sensible for locals who have the opportunity to use our bikes regularly. Possibly, they could even be convinced to extend their bike usage beyond leisure activities and start using Cyclistic for their daily commute. Since June, July and August are the busiest months of the year, May and June should be the ideal time to launch an add campaign. A list of the most popular stations among local casuals and tourists will help us directly address our target group.
Unfortunately, our analysis does not come without limitations. First of all, our data do not contain any information regarding who performed each ride, other than whether they are a member or casual. Not only do we not have access to personal information such as a users place of residence, age or occupational status, but we cannot even differentiate between day-passes and single-rides, nor do we know anything about how frequently individual members and casuals rent bikes. While personal information is not available for obvious privacy reasons, it should at least be possible to connect rides to anonymous user IDs for members, possibly even for casuals, to improve the quality of our data and the inferences they allow. Without this information, the conclusions we can make are a lot more speculative as they require more inferences and assumptions.
We identified some key differences between members and casuals. First and foremost, we collected a lot of evidence that casuals use Cyclistic bikes mostly for leisure activities, while members seem to use them a lot more for commuting. Furthermore, we identified the busiest times of the day, days of the week and months. Lastly, we located the most popular stations and went into detail as to when, how and by whom they are being used. We will now have to prepare a convincing presentation to share with our stakeholders, including actionable recommendations to help develop a marketing strategy to convert casuals to members. The results of this analysis are going to form the basis for this presentation. You can download the final presentation as a Powerpoint or PDF.